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Abstract: Randomized sampling has recently been demonstrated to 
be an efficient technique for computing approximate low-rank factor- 
izations of matrices for which fast methods for computing matrix vec- 
tor products are available. This paper describes an extension of such 
techniques to a wider class of matrices that are not themselves rank- 
qq" ■ deficient, but have off-diagonal blocks that are. Such matrices arise 

C frequently in numerical analysis and signal processing, and there exist 

several methods for rapidly performing algebraic operations (matrix- 
vector multiplications, matrix factorizations, matrix inversion, etc) on 
them once low-rank approximations to all off-diagonal blocks have been 
constructed. The paper demonstrates that if such a matrix can be ap- 
, plied to a vector in O(N) time, where the matrix is of size N xN, and if 

individual entries of the matrix can be computed rapidly, then in many 
cases, the task of constructing approximate low-rank factorizations for 
\ all off-diagonal blocks can be performed in 0(N k 2 ) time, where k is 

■ an upper bound for the numerical rank of the off-diagonal blocks. 
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1. Introduction 



There has recently been much interest in the development of fast algorithms for struc- 
tured matrices of different varieties. One class of such matrices is the class of "Hierarchically 
Semi-Separable" (HSS) matrices, [U[25l[7j. These matrices are characterized by a specific 
type of rank deficiencies in their off-diagonal blocks (as described in Section 12. 4h and arise 
upon the discretization of many of the integral operators of mathematical physics, in signal 
processing, in algorithms for inverting certain finite element matrices, and in many other 
applications, see e.g. \25 \ [6l [22 ], I20|. The common occurrence of such matrices in scientific 
\q . computing motivates the development in [261 US EH E] of fast algorithms for performing 

operations such as matrix-vector multiplies, matrix factorizations, matrix inversions, etc. 

There currently is little consistency in terminology in discussing structured matrices. 
The property that we here refer to as the "HSS" property also arises under different names 
in a range of other publications, for instance [191 12H [26], [22]. It is also closely related to 
^ the '"H 2 -matrices" discussed in [161 HI 0j- The methods described in the present paper are 

directly applicable to the structures described in [191 [2H [22] , an d with minor modifications 
to the structures in [161 HI E] • 

The observation that many matrices that arise in scientific computing have off-diagonal 
blocks that can be approximated well by low-rank matrices underlies many "fast" methods 
such as the Fast Multipole Method [121 E] , panel clustering [15] , Barnes-Hut [1] , ^-matrices 
[17] . etc. It is important to note that the HSS property imposes stronger conditions on the 
off-diagonal blocks than any of the algorithms listed. We describe these conditions in detail 
in Section 12.41 but loosely speaking, the HSS property requires both that large blocks 
directly adjacent to the diagonal can be approximated by low rank matrices, and that the 
basis functions used be "nested", i.e. that the basis functions on one level be expressed as 
linear combinations of the basis functions on the next finer level. The benefits obtained by 
imposing these stronger conditions in part derive from faster algorithms for matrix-vector 
multiplies [22], but more importantly from the fact that they allow other linear algebraic 
operations such as matrix factorizations and inversions to be performed in O(N) time 
[ZlCElG]. However, while there exist very fast algorithms for manipulating such matrices 
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once the factors in the HSS representation are given, it is less well understood how to 
rapidly compute these factors in the first place. For matrices arising from the discretization 
of the boundary integral equations of mathematical physics, [21] describes a technique that 
is O(N) in two dimensions and 0(N 3 / 2 ) in three. In other environments, it is possible 
to use known regularity properties of the off-diagonal blocks in conjunction with standard 
interpolation techniques to obtain rough initial factorizations, and then recompress these 
to obtain factorizations with close to optimal ranks [3]. An approach of this kind with an 
0(N log N) or 0(N log 2 N) complexity, depending on circumstances, is used in |22j . 

The purpose of the present paper is to describe a fast and simple randomized technique 
for computing all factors in the HSS representation of a large class of matrices. It works 
in any environment in which a fast matrix-vector multiplier is available (for instance, an 
implementation of the Fast Multipole Method, or some other legacy code) and it is possible 
and affordable to compute a small number of actual matrix elements. In order to describe 
the cost of the algorithm precisely, we must introduce some notation: We let A be an 
N x N matrix whose off-diagonal blocks have maximal rank k (in the "HSS" -sense, see 
Section f2 . 4 1) . we let T mu i t denote the time required to perform a matrix- vector multiplication 
x i — > Ax, we let T ran( j denote the cost of constructing a pseudo random number from a 
normalized Gaussian distribution, we let T en t ry denote the computational cost of evaluating 
an individual entry of A, and Tfj op denote the cost of a floating point operation. The 
computational cost T tota i of the algorithm then satisfies 

(1.1) T total ~ T mult x 2 {k + 10) + T rand x N {k + 10) + T cntry x 2 N k + T flop x c N k 2 , 

where c is a small constant. In particular, if T mu ] t is O(N), then the method presented here 
is O(N) as well. 

The technique described in this paper utilizes recently published methods for computing 
approximate low-rank factorizations of matrices that are based on randomized sampling 
|231ll8j. As a consequence, there is a finite probability that the method described here may 
fail in any given realization of the algorithm. This failure probability is a user specified 
parameter that in principle could be balanced against computational cost. In practice, 
the probability of failure can at a low cost be made entirely negligible. To be precise, in 
equation (11. ip . the number "10" in the first and the second terms on the right hand side is 
chosen to yield a failure probability that is provably less than 10 -5 , and appears to actually 
be much smaller still. We note that when the method does not fail, the accuracy of the 
randomized scheme is very high; in the environment described in this paper, relative errors 
of less than 10 -10 are easily obtained. 

2. Preliminaries 

In this section, we introduce some notation, and list a number of known results regarding 
low rank factorizations, and hierarchical factorizations of matrices. 

2.1. Notation. Throughout the paper, we measure vectors in M. n using their Euclidean 
norm, and matrices using the corresponding operator norm. 

For an m x n matrix A, and an integer k = 1, 2, . . . , min(m, n), we let <Tk{A) (or simply 
Ok when it is obvious which matrix is being referred to) denote the /c'th singular value of 
A. We assume that these are ordered so that (J\{A) > a2^A) > • • • > c min ( mn ) (A) > 0. We 
say that a matrix A has "e-rank" k if ak+i(A) < e. 

We use the notation of Golub and Van Loan [11] to specify submatrices. In other words, 
if A is an m x n matrix with entries , and I = [i± t %2, ■ ■ ■ , ik] and J = J2, • • • , ji] are 



3 



two index vectors, then we let A(I, J) denote the k x / matrix 

ilJ2 ' ' ' a hji 

ai k j 1 a,i k j 2 ■ ■ ■ ai k j l 

We let the shorthand A(I, :) denote the matrix A(I, [1, 2, ... , n]), and define J) anal- 
ogously. 

Given a set of matrices {Xj} l j =1 we let diag(Xi, X 2 , . . . , X{) denote the block diagonal 
matrix 

X 1 ••• 
x 2 ••• 
x 3 ••• 

••• x t 

2.2. Low rank factorizations. We say that an m x n matrix A has exact rank k if there 
exist anmxi matrix E and a k x n matrix F such that 

A = EF. 

In this paper, we will utilize three standard matrix factorizations. In describing them, we 
let A denote an m x n matrix of rank k. The first is the so called "QR" factorization: 

A = QR 

where Q is an m x k matrix whose columns are orthonormal, and R is a k x n matrix 
with the property that a permutation of its columns is upper triangular. The second is the 
"singular value decomposition" (SVD): 

A = UDV\ 

where the mxk matrix U and the nxk matrix V have orthonormal columns, and the kxk 
matrix D is diagonal. The third factorization is the so called "interpolatory decomposition" : 

A = A(:,J)X, 

where J is a vector of indices marking k of the columns of A, and the k x n matrix X 
has the kxk identity matrix has a submatrix and has the property that all its entries 
are bounded by 1 in magnitude. In other words, the interpolatory decomposition picks k 
columns of A as a basis for the column space of A and expresses the remaining columns in 
terms of the chosen ones. 

The existence for all matrices of the QR factorization and the SVD are well-known, as 
are techniques for computing them accurately and stably, see e.g. [TT]. The interpolatory 
decomposition is slightly less well known but it too always exists, and there are stable and 
accurate techniques for computing it, see e.g. |14^ [9]. (Practical algorithms for computing 
the interpolatory decomposition may produce a matrix X whose elements slightly exceed 1 
in magnitude.) In the pseudo code we use to describe the methods of this paper, we refer 
to such algorithms as follows: 

[Q, R] = qr(A), [U, D, V] = svd(A), [X, J] = interpolate(A). 

In the applications under consideration in this paper, matrices that arise are typically 
only approximately of low rank. Moreover, their approximate ranks are generally not known 



A(I,J) 
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a priori. As a consequence, the algorithms will typically invoke versions of the factorization 
algorithms that take the computational accuracy e as an input parameter. For instance, 

[U, D, V] = svd(A, e) 

results in matrices U, D, and V of sizes m x k, n x k, and k x k, such that 

\\UDV t -A\\ < e. 

In this case, the number k is the e-rank of A, and is of course an output of the algorithm. The 
corresponding functions for computing an approximate QR factorization or an interpolatory 
decomposition are denoted 

[Q,R] = qr(A, e), [X, J] = interpolate^, e). 

Remark 2.1. Standard techniques for computing partial QR and interpolatory factoriza- 
tions (Gram- Schmidt, Householder, etc) produce results that are guaranteed in the Frobe- 
nius norm (or some related matrix norm that can be computed via 0(mn) methods). It 
is possible to modify such techniques to measure remainders in the Z 2 -operator norm, but 
we have found no need to utilize such techniques since in all the applications that we have 
studied so far, the relevant matrix norms are excellent predictors for each other. 

2.3. Construction of low-rank approximations via randomized sampling. Let A 

be a given m x n matrix that we know can accurately be approximated by a matrix of 
rank k (which we do not know), and suppose that we seek to determine a matrix Q with 
orthonormal columns (as few as possible) such that 

\\A-QQ l A\\ 

is small. (In other words, we seek a matrix Q whose columns form an approximate ON-basis 
for the column space of A.) When we have access to a fast technique for computing matrix 
vector products x i-> Ax, this task can efficiently be solved using randomized sampling via 
the following steps: 

(1) Pick an integer I that is slightly larger than k (the choice I = k + 10 will turn out 
to be a good one). 

(2) Form an n x I matrix R whose entries are drawn independently from a normalized 
Gaussian distribution. 

(3) Form the product S = A R. 

(4) Construct a matrix Q whose columns form an ON-basis for the columns of S. 
Note that each column of the "sample" matrix S is a random linear combination of the 
columns of A. We would therefore expect the algorithm described to have a high probability 
of producing an accurate result provided that I is sufficiently much larger than k. It is 
perhaps less obvious that this probability depends only on the difference between I and k 
(not on m or n, or any other properties of A), and that it approaches 1 extremely rapidly 
as I — k increases. The details are given in the following theorem from |23| : 

Theorem 2.1. Let A be an m x n matrix, and let I and k be integers such that I > k. Let 
R be an n x I matrix whose entries are drawn independently from a normalized Gaussian 
distribution. Let Q be an m x I matrix whose columns form an ON-basis for the columns 
of AR. Let (Jfc+i denote the minimal error in approximating A by a matrix of rank k: 

o~k+i = m i n 1 1 ^4 — B\\. 

rank(B)=k 



Then 
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with probability at least 

l-tpfl-k), 

where ip is a decreasing function satisfying, for instance, <p(8) < 10 -5 and <£>(20) < 1CP 17 . 

In this paper, we always set I = k+10; note however that I— k is a user defined parameter 
that can be set to balance the risk of failure against computational cost. 

Remark 2.2. In practical applications, it is in Step (4) of the algorithm described above 
sufficient to construct an ON-basis for the columns of the sample matrix S that is accurate 
to precision e. Under very moderate conditions on decay of the singular values of A, the 
number of basis vectors actually constructed will be extremely close to the optimal number, 
regardless of the number I. 

Remark 2.3. The approximate rank k is rarely known in advance. In a situation where a 
single matrix A is to be analyzed, it is a straight-forward matter to modify the algorithm 
described here to an algorithm that adaptively determines the numerical rank by generating 
a sequence of samples from the column space of A and simply stopping when no more 
information is added. In the application we have in mind in this paper, however, the 
randomized scheme will be used in such a way that a single random matrix will be used 
to create samples of a large set of different matrices. In this case, we choose a number I 
of random samples that we are confident exceeds the numerical rank of all the matrices by 
at least 10. Note that the bases constructed for the various matrices will have the correct 
number of elements due to the observation described in Remark 12 .21 

Remark 2.4. The randomized sampling technique is particularly effective when used in 
conjunction with the interpolatory decomposition. To illustrate, let us suppose that A is an 
n x n matrix of rank k for which we can rapidly evaluate the maps x i— > A x and x i— > A t x. 
Using the randomized sampling technique, we then construct matrices S°° l = AR and 
grow _ j^t wnose columns span the column and the row spaces of A, respectively. If we 
seek to construct a factorization of A without using the interpolatory decomposition, we 
would then orthonormalize the columns of S co1 and S row , 

[Q co \ Y co1 } = qr(5 co1 ), and [Q row , y row ] = qr(S row ), 

whence 

(2.1) A = Q co1 ((Q™ 1 ) 1 A Q row ) Q row . 

Note that the evaluation of (|2.1[) requires k matrix-vector multiplies involving the large 
matrix A in order to compute the k x k matrix ((J™ 1 )* A Q row . Using the interpolatory 
decomposition instead, we simply determine the k rows of S co1 and S row that span their 
respective row spaces, 

[X co \ J co1 ] = interpolate^™ 1 )*), and [X row , J row ] = interpolate^™)*). 

Then we immediately obtain the factorization 

(2.2) A = X co1 A{J co \ J row ) (X xaw )\ 

Note that the factorization (|2.2[) is obtained by simply extracting the k x k submatrix 
A(J co \ J row ) from A. 

Remark 2.5. In practical application, the entries of the random matrix R are not "true" 
random numbers, but numbers from a "pseudo random number generator". Empirical 
experiments indicate that the algorithm is not at all sensitive to the quality of the random 
number generator. 
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(0,1) 

Level I 1 

(1,1) (1,2) 
Level 1 | — 1 — 1 

t 19 , (2,1) , (2,2) (2,3) (2,4) 
Level 2 | 1 1 1 1 

Level3 , (3, 1) | (3, 2) | (3, 3) | (3, 4) | (3, 5) | (3, 6) | (3, 7) | (3, 8) | 
Figure 2.1. The binary tree of nodes. 



2.4. Hierarchically Semi-Separable matrices. In this section, we define the class of 
"Hierarchically Semi-Separable" (HSS) matrices, and introduce notation for keeping track 
of various blocks of structured matrices. A more detailed discussion of this topic can be 
found in e.g. [7] [25]. 

In order to define the HSS property for an N x N matrix A, we first partition the index 
vector / = [1, 2, . . . , N] in a hierarchy of index sets. For simplicity, we limit attention 
to binary tree structures in which every level is fully populated. We use the integers 
p = 0, 1, P to label the different levels, with P denoting the finest level. Thus at level 
P, we partition / into 2 P disjoint vectors {/(pj)}| =1 so that 

1 = [^(P,l), ^(P,2), • • • , I(P,2 p )\- 

We then merge the index vectors by twos, so that for each p = 0, 1, . . . , P — 1, and each 
j = 1, 2, . . . , 2 P , we have 

hp J) = 0+1,2.7-1), J (p+l,2j)]- 

For a given node r = (p,j), we call the two nodes o\ = {p+ 1, 2j — 1) and 02 = (p+ 1, 2j) the 
"children" of r, and we say that a\ and a"2 are "siblings". The tree structure is illustrated 
in Figure [27T1 

For any node r = (p,j) in the tree, we define the corresponding diagonal block of A via 

D T = A{I T , I T ), 

and let denote the N x N matrix with the matrices {-D( P j)}|Li as its diagonal blocks, 
(2.3) r» = diag(£> (Pjl) , D^, D (P)2P) ). 

For a node r = (p,j), we now define the corresponding "HSS row block" A T ° W and "HSS 
column block" A^ by 

A* T ™ = A(I T , :) - r»(/ T , :), and Af = A(:, I T ) - D^(:,I T ). 

These definitions are illustrated in Figure [221 

Definition 2.1. A matrix A is an "HSS matrix" if for some given positive integer k, every 
HSS block has rank at most k. 

It is convenient to construct factorizations for the HSS blocks that are "nested" in the 
sense that the bases on one level are expressed in terms of the bases on the next finer level. 
To express this concept in formulas, we suppose that for each node r, U T denotes a matrix 
whose columns form a basis for the column space of the HSS row block A r ° w . It is possible 
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(b) 

Figure 2.2. (a) The matrix A — with non-zero parts shaded. The HSS 
row block A T ,™ 6 -. is marked with a thick border, (b) The matrix A — 

with A 1 ^ marked, (c) The matrix A - with A 1 ^ marked. 



to construct the set of bases {U T } in such a way that for every non-leaf node r, there exists 
a 2k x k matrix U T such that 



(2.4) 



U, 











where o\ and 02 denote the two children of r. We analogously construct a set {V^-} of bases 
for the row spaces of the HSS column blocks A c ° l for which there exist 2k x k matrices V T 
such that 

" 



Vr = 



"(71 




V. 



(T 2 



Vr. 



Next, let {(Ti, CT2} denote a sibling pair in the tree, and consider the offdiagonal block 



A 



0\02 



Since ^4(t 10 - 2 is a submatrix of the HSS row block A™™ and the HSS column block A a °^, 
there must exist a k x k matrix B aia2 such that 



A 



U ai B crir72 



An HSS matrix A is completely described if for every leaf node r, we are given the 
matrices D T and the basis matrices U T and V T , if for all sibling pairs {a±, (T2} we are given 
the matrices B aia2 , and if for each non-leaf node r we are given the matrices U T and V T . 
In particular, given a vector x, the vector b = Ax can be evaluated via the following steps: 

(1) For every leaf node r, calculate x T = V* x(I T ). 



(2) Looping over all non-leaf nodes r, from finer to coarser, calculate x T = V^ 

where a\ and o"2 are the children of r. 

(3) Looping over all non-leaf nodes r, from coarser to finer, calculate 
B n 



Xq 

X n 



CT1<T2 f 1 + U T b T (where b T = for the root node). 

(4) For every leaf node r, calculate b(I T ) = U T b T + D T x(I T ). 
The computational cost of performing these steps is 0(N k). 

Remark 2.6. The matrix A can be expressed in terms of the matrices D T , U T , V T , and 
B aia2 as a telescoping factorization. To demonstrate this, we introduce for each level 







. ~b C2 _ 
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p = 1, 2, . . . , P the block-diagonal matrices 
[/(p) = diag(C/ (Pi i ) , U (P)2 ), U {P)2 p)), and V ip) = diag(V" (Pil) , V^ 2 ), V( P) 2p))- 
Moreover, we define for each non-leaf node r the 2k x 2k matrices 

R n 

f7 2 cr l 

where <ti and c"2 are the children of r, and set for p = 0, 1, . . . , P — 1 

-B (p) = diag(5 (p l ), B(p >2 ), • • • , B { J ,,v>))- 
Finally, recall from (|2.3|) that 

D (p) = diag(£> (Pjl) , D (Pj2) , D {P ^ P) ). 
For simplicity, we give the factorization for the specific value P = 3: 

(2.5) A = (tf(2) ^(1) B ( ) (y (l) )t + (y (2) )t + B (2)J (y(3)y + ^,(3) 



The block structure of the formula (|2.5p is as follows: 




Remark 2.7. For notational simplicity, we consider only the case where all HSS blocks 
are approximated by factorizations of the same rank k. In practice, it is a simple matter 
to implement algorithms that use variable ranks. 

Remark 2.8. It is common to require the matrices U T and V T that arise in an HSS fac- 
torization of a given matrix to have orthonormal columns. We have found it convenient to 
relax this assumption and allow the use of other well-conditioned bases. In particular, the 
use of interpolative decompositions (as described in Section 12. 2p is essential to the perfor- 
mance of the fast factorization technique described in Section 13.11 A simple algorithm for 
switching between the two formulations (using orthonormal bases, or interpolatory ones) is 
described in Section 13.21 



3. Fast computation of HSS approximations 

The straight-forward way of computing the HSS factorization of matrix would be to form 
all HSS blocks, and then perform dense linear algebra operations on them to construct the 
required basis functions. This approach would require at least 0(N 2 k) algebraic operations 
to factorize an N x N matrix of HSS rank k. In this section, we describe how the randomized 
sampling techniques described in Section [2731 can be used to reduce this cost to 0(N k 2 ). 

The technique described relies crucially on the use of the interpolatory decompositions 
described in Section T2.2I in the HSS factorization. The advantage is that the matrices B ai(T2 
are then submatrices of the original matrix A and can therefore be constructed directly 
without a need for projecting the larger blocks onto the bases chosen, cf. Remark 12.41 

Section [3711 descr ib es a scheme for rapidly computing the HSS factorization of a symmetric 
matrix. The scheme described in Section [3~H results in a factorization based on interpolatory 
bases and the blocks B aia2 are submatrices of the original matrix; Section 13.21 describes 
how such a factorization can be converted to one in which the bases for the HSS blocks are 
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orthonormal, and the blocks B aia2 are diagonal. Section 13.31 describes how to extend the 
methods to non-symmetric matrices. 

3.1. A scheme for computing an HSS factorization of a symmetric matrix. Let 

A be an N X N symmetric matrix that has an HSS factorization of rank k. Suppose further 
that: 

(a) Matrix vector products x i— ► Ax can be evaluated at a cost T mu i t . 

(b) Individual entries of A can be evaluated at a cost T entry . 

In this section, we will describe a scheme for computing an HSS factorization of A in time 

Ttotai ~ T mult x (k + 10) + T rand x N (k + 10) + T cntry x 2 N k + T flop x c N k 2 , 

where T ran( i is the time required to generate a pseudo random number, Tfl op is the CPU time 
requirement for a floating point operation and c is a small number that does not depend 
on TV or k. 

The core idea of the method is to construct an N x (k + 10) random matrix R, and then 
construct for each level p, the "sample" matrices 

S(p) = (A- D&) R, 

via a procedure to be described. Then for any cell r on level p, 

S ip \l T ,:)=A r T ow R, 

and since A™ w has rank k, the columns of (I T , :) span the column space of A™ w according 
to Theorem 12.11 We can then construct a basis for the column space of the large matrix 
A™ w by analyzing the small matrix Si p \l T , :). 

What makes the procedure fast is that the sample matrices can be constructed by 
means of an O(N) process from the result of applying the entire matrix A to R, 

S = AR. 

At the finest level, p = P, we directly obtain S^- p ) from S by simply subtracting the 
contribution from the diagonal blocks of A, 

(3.1) S (p) = S- D (p) R. 

Since D^ p ) is block diagonal with small blocks, equation (|3.ip can be evaluated cheaply. To 
proceed to the next coarser level, p = P — 1, we observe that 

(3.2) S( p -V = {A- D( p -V) R=(A- D^) R - (D^ - D^) R 

= S^-{D^-D^)R. 

Now (D( p -V -£>( p )) has only 2 P non-zero blocks. The pattern of these blocks is illustrated 
for P = 3 below: 




Each of these blocks were compressed in the computation at level P so (13. 2ft can also be 
evaluated rapidly. The algorithm then proceeds up towards coarser levels via the formula 

S (P-1) = s (p) _ (£,(p-i) _ D (p) )R, 
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Input: A fast means of computing matrix- vector products x i— > Ax. 
A method for computing individual entries of A. 
The HSS-rank k of A. 

A partitioning { I(pj) }j=i of the interval [1, 2, ... , N). 
Output: Matrices U T , B aia2 , D T that form an HSS factorization of A. 
(Note that V T = U T for a symmetric matrix.) 



Generate an N x (k + 10) Gaussian random matrix R. 
Evaluate S = A R using the fast matrix- vector multiplier, 
loop over levels, finer to coarser, p = P : (—1) : 1 
loop over all nodes r on level p 
if r is a leaf node then 

-^loc It 

Rloc = R{Iti '■) 

Sioc = S(I T , :) — A(I T ,I T ) R\ oc 



else 



Let o"i and 02 be the two children of r. 
hoc 



[I(T 1 1 I(T 2 



R 



loc 



R a 
R„ 



Slo 



s, 



a 2 



1 I(T2 I 

A(Ia 2 ; I<T\ 1 



a 2 



end if 

[U T , J T ] = interpolate(S' 1 t oc ) 

R T = U T Rioc 

S T = Sioc{Jti 
It — Iioc{Jt) 

end loop 
end loop 

For all leaf nodes r, set D T = A(I T , I T ). 
For all sibling pairs {a%, 02} set B ai(T2 = 



A(I t 



>T I, ) I&2 J 



Algorithm 1 : Computing the HSS factorization of a symmetric matrix. 



which can be evaluated rapidly since the blocks of (D^ p ~^ — D^) have at this point been 
compressed. 

The condition that the bases be "nested" in the sense of formula (|2.4p can conveniently 
be enforced by using the interpolative decompositions described in Section [2T2t At the finest 
level, we pick k rows of each HSS row block that span its row space. At the next coarser 
level, we pick in each HSS row block k rows that span its row space out of the 2k rows that 
span its two children. By proceeding analogously throughout the upwards pass, (|2.4p will 
be satisfied. 

The use of interpolatory decompositions has the additional benefit that we do not need 
to form the entire matrices S^' when p < P. Instead, we work with the submatrices formed 
by keeping only the rows of corresponding to the spanning rows at that step. 

A complete description of the methods is given with the caption "Algorithm 1" . 

Remark 3.1. For simplicity, Algorithm 1 is described for the case where the off-diagonal 
blocks of A has exact rank at most k, and the number k is known in advance. In actual 
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applications, one typically is given a matrix A whose off-diagonal blocks are not necessarily 
rank-deficient in an exact sense, but can to high accuracy be approximated by low-rank 
matrices. In this case, Algorithm 1 needs to be modified slightly to take as an input the 
computational accuracy e instead of the rank k, and the line 

[U T , J T ] = interpolate^^) 

needs to be replaced by the line 



[U T , J T ] = interpolate^ 



loc 



This directly leads to a variable rank algorithm that is typically far more efficient than the 
fixed rank algorithm described. 

3.2. Recompression into orthonormal basis functions. In this section, we describe a 
simple post-processing step that converts the HSS factorization resulting from Algorithm 1 
(which is based on interpolatory factorizations) into a factorization with orthonormal basis 
functions. This process also diagonalizes all matrices B ai(T2 . 

Suppose that the matrices U T , D T , and B U1(72 in an HSS factorization of a symmetric 
matrices have already been generated (for instance by the algorithm of Section [3. 1 j) . The 
method described here produces new matrices [/° ew and B™™ 2 with the property that each 
jjnew j-^g or thonormal columns, and each B^™ 2 is diagonal. 

Remark 3.2. The method described here does not in any way rely on particular properties 
of the interpolative decomposition. In fact, it works for any input matrices U T , D T , B aia2 
for which the factorization (12.51) holds. 



The orthonormalization procedure works hierarchically, starting at the finest level and 
working upwards. At the finest level, it loops over all sibling pairs {a\, a 2 }. It orthonor- 
malizes the basis matrices U ai and TJ a% by computing their QR factorizations, 

[Wt, Rt] = qx(U ai ) and [W 2 , R 2 ] = qr(U n ), 

so that 

U ai = Wi R x and U a2 = W 2 R 2 , 

and W\ and W 2 have orthonormal columns. The matrices R\ and R 2 are then used to 
update the diagonal block B aia2 to reflect the change in basis vectors, 

B\ 2 = Ri -B,j lCT2 R 2 . 

Then B\ 2 is diagonalized via a singular value decomposition, 

B 12 = WiB%™ 2 W 2 . 

The new bases for o\ and a 2 are constructed by updating W\ and W 2 to reflect the diago- 
nalization of Bi 2 , 

U™ w = Wi Wx , and U™ w = W 2 W 2 . 

Finally, the basis vectors for the parent r of o\ and o 2 must be updated to reflect the 
change in bases at the finer level, 



Ur 



W\ R x 
W\ R 2 



Once the finest level has been processed, simply move up to the next coarser one and 
proceed analogously. A complete description of the recompression scheme is given with the 
caption "Algorithm 2". 
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Input: The matrices U T , B aia2 , D T in an HSS factorization of a symmetric matrix A. 
Output: Matrices J7° cw , -B"™ , and D T that form an HSS factorization of A such that 

all U^ cw have orthonormal columns and all B^™ 2 are diagonal. 

(The matrices D T remain unchanged.) 



Set C/* mp = U T for all leaf nodes r. 
loop over levels, finer to coarser, p = P — 1, P — 2, 
loop over all nodes r on level p 

Let a i and oi denote the two sons of r. 

[H/ 1 ,i? 1 ]=qr(L>* r 1 np ) 

[W 2 , i? 2 ] 



qr(tC P ) 



Wi , B2f£., W9I = svd( J R 1 5 CT1(T2 i^) 



^ '1 

/Vnew 

rrnew 
U cr 2 

imp 



I7r 

end loop 
end loop 



'<T1(T2' ,/2 J 

= W 2 H> 2 




Remark: In practice, we let the matrices U T and [7^ ew simply overwrite C/ T . 

Algorithm 2: Orthonormalizing an HSS factorization 



3.3. Non-symmetric matrices. The extension of Algorithms 1 and 2 to the case of non- 
symmetric matrices is straight-forward. In Algorithm 1, we construct a set of sample 
matrices {S T } with the property that the columns of each S T span the column space of the 
corresponding HSS row block A T ° W . Since A is in that case symmetric, the columns of 5 T 
automatically span the row space of A c ° x as well. For non-symmetric matrices, we need to 
construct different sample matrices Sl° w and S^° l whose columns span the column space 
of A™ w and the row space of A™ 1 , respectively. Note that in practice, we work only with 
the subsets of all these matrices formed by the respective spanning rows and columns; in 
the non-symmetric case, these may be different. The algorithm is described in full with the 
caption "Algorithm 3". 

The generalization of the orthonormalization technique is entirely analogous. 
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rrow _ 
i loc " 
prow 

crow 
loc 



rcol _ 
J loc " 
pcol 

ccol . 
^loc 



S CO \lr,-) 



A(I T ,I T fR& 



rrow _ 
i loc " 

prow 



crow 

^loc - 



row rrow] 

(71 ) J (T2 

nrow 

prow 

Crow 

crow 
°(T2 



rcol _ 
"Moc " 

pcol 
-"loc 



rcol rcol 



rjco 
" ^col - 



JCOll 
(72 J 



row 

?l 
row 
CT2 



jcol\ 

(72 > 
JC 1\ 



interpolate^^)*) 



t prow 
^loc 
row . \ 



end if 

[X T T ow , J r T ow ] -. 

r^row _ ^^ c °l 

crow crow/' 

— '-'loc V J r ! 

rrow rrow/' rrow\ 

J r — -'loc 1 J t ) 

end loop 
end loop 
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For all sibling pairs {<ti, 0-2} set -B^^ 



R row 

prow 



ccol 

15 loc - 



col 



K 2 



S^-A(I- 



row 

row 
0-2 



jcoh 

(72 / 

jcol\ 

<T1 / 



pcol 
pcol 



R c T o1 = 
S™ 1 = 

rcol _ 



J™ 1 ] = interpolate^™ 1 )*) 

(X? 

loc Wt 
•o\i 
loc* 



ccolf 7C0I 
rcol/' rcol\ 



pcol 
-"loc 



Algorithm 3: Computing the HSS factorization of a non-symmetric matrix. 
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